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Abstract: Eventual consistency aims to ensure that replicas of some mutable shared 

object converge without foreground synchronisation. Previous approaches to eventual con¬ 
sistency are ad-hoc and error-prone. We study a principled approach: to base the design of 
shared data types on some simple formal conditions that are sufficient to guarantee even¬ 
tual consistency. We call these types Convergent or Commutative Replicated Data Types 
(CRDTs). This paper formalises asynchronous object replication, either state based or op¬ 
eration based, and provides a sufficient condition appropriate for each case. It describes 
several useful CRDTs, including container data types supporting both add and remove op¬ 
erations with clean semantics, and more complex types such as graphs, montonic DAGs, 
and sequences. It discusses some properties needed to implement non-trivial CRDTs. 
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Etude approfondie des types de donnees repliques 
convergent s et commutatifs 

Resume : La coherence a terme vise a assurer que les repliques d’un objet partage 
modifiable convergent sans synchronisation a priori. Les approches anterieures du probleme 
sont ad-hoc et sujettes a erreur. Nous proposons une approche basee sur des principes 
formels: baser la conception des types de donnees sur des proprietes mathematiques simples, 
suffisantes pour garantir la coherence a terme. Nous appelons ces types de donnees des 
CRDT (Convergent/Commutative Replicated Data Types). Ce papier fournit formalise la 
replication asynchrone, qu’elle soit basee sur l’etat ou sur les operations, et fournit une 
condition sufhsante adaptee a chacun de ces cas. II decrit plusieurs CRDT utiles, dont des 
contenants permettant les operations add et remove avec une semantique propre, et des 
types de donnees plus complexes comme les graphes, les graphes acycliques monotones, et 
les sequences. II contient une discussion de proprietes dont on a besoin pour mettre en 
oeuvre des CRDT non triviaux. 

Mots-cles : Replication des donnees, replication optimiste, operations commutatives 
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1 Introduction 

Replication is a fundamental concept of distributed systems, well studied by the distributed 
algorithms community. Much work focuses on maintaining a global total order of operations 
[24] even in the presence of faults [8]. However, the associated serialisation bottleneck 
negatively impacts performance and scalability, while the CAP theorem [13] imposes a trade¬ 
off between consistency and partition-tolerance. 

An alternative approach, eventual consistency or optimistic replication, is attractive to 
practioners [37, 41]. A replica may execute an operation without synchronising a priori with 
other replicas. The operation is sent asynchronously to other replicas; every replica even¬ 
tually applies all updates, possibly in different orders. A background consensus algorithm 
reconciles any conflicting updates [4, 40]. This approach ensures that data remains available 
despite network partitions. It performs well (as the consensus bottleneck has been moved 
off the critical path), and the weaker consistency is considered acceptable for some classes 
of applications. However, reconciliation is generally complex. There is little theoretical 
guidance on how to design a correct optimistic system, and ad-hoc approaches have proven 
brittle and error-prone. 1 

In this paper, we study a simple, theoretically sound approach to eventual consistency. 
We propose the concept of a convergent or commutative replicated data type (CRDT), for 
which some simple mathematical properties ensure eventual consistency. A trivial example 
of a CRDT is a replicated counter, which converges because the increment and decrement 
operations commute (assuming no overflow). Provably, replicas of any CRDT converge 
to a common state that is equivalent to some correct sequential execution. As a CRDT 
requires no synchronisation, an update executes immediately, unaffected by network latency, 
faults, or disconnection. It is extremely scalable and is fault-tolerant, and does not require 
much mechanism. Application areas may include computation in delay-tolerant networks, 
latency tolerance in wide-area networks, disconnected operation, churn-tolerant peer-to-peer 
computing, data aggregation, and partition-tolerant cloud computing. 

Since, by design, a CRDT does not use consensus, the approach has strong limitations; 
nonetheless, some interesting and non-trivial CRDTs are known to exist. For instance, we 
previously published Treedoc, a sequence CRDT designed for co-operative text editing [32]. 

Previously, only a handful of CRDTs were known. The objective of this paper is to push 
the envelope, studying the principles of CRDTs, and presenting a comprehensive portfolio of 
useful CRDT designs, including variations on registers, counters, sets, graphs, and sequences. 
We expect them to be of interest to practitioners and theoreticians alike. 

Some of our designs suffer from unbounded growth; collecting the garbage requires a 
weak form of synchronisation [25]. However, its liveness is not essential, as it is an optimi¬ 
sation, off the critical path, and not in the public interface. In the future, we plan to extend 
the approach to data types where common-case, time-critical operations are commutative, 

1 The anomalies of the Amazon Shopping Cart are a well-known example [10]. 
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and rare operations require synchronisation but can be delayed to periods when the network 
is well connected. This concurs with Brewer’s suggestion for side-stepping the CAP impos¬ 
sibility [6]. It is also similar to the shopping cart design of Alvaro et al. [1], where updates 
commute, but check-out requires coordination. However, this extension is out of the scope 
of the present study. 

In the literature, the preferred consistency criterion is linearisability [18]. However, 
linearisability requires consensus in general. Therefore, we settle for the much weaker qui¬ 
escent consistency [17, Section 3.3]. One challenge is to minimise “anomalies,” i.e., states 
that would not be observed in a sequential execution. Note also that CRDTs are weaker 
than non-blocking constructs, which are generally based on a hardware consensus primitive 

[17]- 

Some of the ideas presented here paper are already known in the folklore. The contri¬ 
butions of this paper include: 

• In Section 2: (i) An specification language suited to asynchronous replication, (ii) A 
formalisation of state-based and operation-based replication, (in) Two sufficient con¬ 
ditions for eventual consistency. 

• In Section 3, an comprehensive collection of useful data type designs, starting with 
counters and registers. We focus on container types (sets and maps) supporting both 
add and remove operations with clean semantics, and more complex derived types, 
such as graphs, monotonic DAGs, and sequence. 

• In Section 4, a study of the problem of garbage-collecting meta-data. 

• In Section 5, exercising some of our CRDTs in a practical example, the shopping cart. 

• A comparison with previous work, in Section 6. 

Section 7 concludes with a summary of lessons learned, and perspectives for future work. 


2 Background and system model 

We consider a distributed system consisting of processes interconnected by an asynchronous 
network. The network can partition and recover, and nodes can operate in disconnected 
mode for some time. A process may crash and recover; its memory survives crashes. We 
assume non-byzantine behaviour. 

2.1 Atoms and objects 

A process may store atoms and objects. An atom is a base immutable data type, identified 
by its literal content. Atoms can be copied between processes; atoms are equal if they have 
the same content. Atom types considered in this paper include integers, strings, sets, tuples, 
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Figure 1: Object 
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etc., with their usual non-mutating operations. Atom types are written in lower case, e.g., 
“set.” 

An object is a mutable, replicated data type. Object types are capitalised, e.g., “Set.” 
An object has an identity, a content (called its payload), which may be any number of atoms 
or objects, an initial state, and an interface consisting of operations. Two objects having 
the same identity but located in different processes are called replicas of one another. As 
an example, Figure 1 depicts a logical object x, its replicas at processes 1,2 and 3, and the 
current state of the payload of replica 3. 

We assume that objects are independent and do not consider transactions. Therefore, 
without loss of generality, we focus on a single object at a time, and use the words process 
and replica interchangeably. 

2.2 Operations 

The environment consists of unspecified clients that query and modify object state by calling 
operations in its interface, against a replica of their choice called the source replica. A query 
executes locally, i.e., entirely at one replica. An update has two phases: first, the client calls 
the operation at the source, which may perform some initial processing. Then, the update 
is transmitted asynchronously to all replicas; this is the downstream part. The literature 
[37] distinguishes the state-based and operation-based (op-based for short) styles, explained 
next. 
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Specification 1 Outline of a state-based object specification. Preconditions, arguments, return 
values and statements are optional. 

1: payload Payload type; instantiated at all replicas 
2: initial Initial value 

3: query Query ( arguments ) : returns 
4: pre Precondition 

5: let Evaluate synchronously, no side effects 

6: update Source-local operation ( arguments ) : returns 
7: pre Precondition 

8: let Evaluate at source, synchronously 

9: Side-effects at source to execute synchronously 

10: compare (valuel, value2) : boolean b 
11: Is valuel < value2 in semilattice? 

12: merge (valuel, value2) : payload mergedValue 
13: LUB merge of valuel and value2, at any replica 





Figure 4: State-based replication 


4 



Figure 5: Example CvRDT: integer + max 


INRIA 













in [is •I'll! I III, u is it 11 I -15 Jan nil 


A comprehensive study of CRDTs 


2.2.1 State-based replication 

In state-based (or passive) replication, an update occurs entirely at the source, then propa¬ 
gates by transmitting the modified payload between replicas, as illustrated in Figure 4. 

We specify state-based object types as shown in Specification 1. Keyword payload indi¬ 
cates the payload type, and initial specifies its initial value at every replica. Keyword update 
indicates an update operation, and query a query. Both may have (optional) arguments 
and return values. Non-mutating statements are marked let, and payload is mutated by 
assignment :=. An operation executes atomically. 

To capture safety, an operation is enabled only if a given source pre-condition (marked 
pre in a specification) holds in the source’s current state. The source pre-condition is omitted 
if always enabled, e.g., incrementing or decrementing a Counter. Conversely, non-null pre¬ 
conditions may be necessary, for instance an element can be removed from a Set only if it 
is in the Set at the source. 

The system transmits state between arbitrary pairs of replicas, in order to propagate 
changes. This updates the payload of the receiver with the output of operation merge , 
invoked with two arguments, the local payload state and the received state. Operation 
compare compares replica states, as will be explained shortly. 

We define the causal history [38] C of replicas of some object x as follows: 2 
Definition 2.1 (Causal History — state-based). For any replica Xi of x: 

• Initially, Cfxf) = 0. 

• After executing update operation f, C(f(xi)) = C(xi) U {/}. 

• After executing merge against states Xi,Xj, C(merge(xi,Xj)) = C(xi) U C(xj). 

The classical happens-before [24] relation between operations can be defined as / —> 
g^C(f)cC(g). 

Liveness requires that any update eventually reaches the causal history of every replica. 
To this effect, we assume an underlying system that transmits states between pairs of replicas 
at unspecified times, infinitely often, and that replica communication forms a connected 
graph. 

2.2.2 Operation-based (op-based) objects 

In operation-based (or active) replication, the system transmits operations, as illustrated in 
Figure 6. This style is specified as outlined in Spec. 2. The payload and initial clauses are 
identical to the state-based specifications. An operation that does not mutate the state is 
marked query and executes entirely at a single replica. 

An update is specified by keyword update. Its first phase, marked atSource, is local to 
the source replica. It is enabled only if its (optional) source pre-condition, marked pre, is 

2 C is a logical function, it is not part of the object. 
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Specification 2 Outline of operation-based object specification. Preconditions, return values 
and statements are optional. 

1: payload Payload type; instantiated at all replicas 
2: initial Initial value 

3: query Source-local operation ( arguments ) : returns 
4: pre Precondition 

5: let Execute at source, synchronously, no side effects 

6: update Global update ( arguments ) : returns 
7: atSource ( arguments ) : returns 

8: pre Precondition at source 

9: let 1st phase: synchronous, at source, no side effects 

10: downstream (arguments passed downstream) 

11: pre Precondition against downstream state 

12: 2nd phase, asynchronous, side-effects to downstream state 



Figure 6: Operation-Based Replication 
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true in the source state; it executes atomically. It takes its arguments from the operation 
invocation; it is not allowed to make side effects; it may compute results, returned to the 
caller, and/or prepare arguments for the second phase. 

The second phase, marked downstream, executes after the source-local phase; immedi¬ 
ately at the source, and asynchronously, at all other replicas; it can not return results. It 
executes only if its downstream precondition is true. It updates the downstream state; its 
arguments are those prepared by the source-local phase. It executes atomically. 

As above, we define the causal history of a replica Cfxf). 

Definition 2.2 (Causal History — op-based). The causal history of a replica Xi is defined 
as follows. 

• Initially, C{xf) = 0 . 

• After executing the downstream phase of operation f at replica x^, C(f(xi)) = C{xf) U 

{/}• 

Liveness requires that every update eventually reaches the causal history of every replica. 
To this effect, we assume an underlying system reliable broadcast that delivers every update 
to every replica in an order <4 (called delivery order ) where the downstream precondition 
is true. 

As in the state-based case, the happens-before relation between operations is defined 
by / —> g C{f) C G(g). We define causal delivery <_> as follows: if / — > g then / is 
delivered before g is delivered. We note that all downstream preconditions in this paper 
are satisfied by causal delivery, i.e., delivery order is the same or weaker as causal order: 
/ <d 2 => / <-> 9 - 

2.3 Convergence 

We now formalise convergence. 

Definition 2.3 (Eventual Convergence). Two replicas Xi and x 3 of an object x converge 
eventually if the following conditions are met: 

• Safety: Vi, j : C{xf) = C(xj) implies that the abstract states of i and j are equivalent. 

• Liveness: V*,jf: f £ C(xi) implies that, eventually, f £ C(xj). 

Furthermore, we define state equivalence as follows: x t and Xj have equivalent abstract 
state if all query operations return the same values. 

Pairwise eventual convergence implies that any non-empty subset of replicas of the object 
converge, as long as all replicas receive all updates. 
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2.3.1 State-based CRDT: Convergent Replicated Data Type (CvRDT) 

A join semilattice [9] (or just semilattice hereafter) is a partial order < v equipped with a 
least upper bound (LUB) l_I„, defined as follows: 

Definition 2.4 (Least Upper Bound (LUB)). m = x\J v y is a Least Upper Bound of {x,y} 
under < v iff x < v m and y < v m and there is no m' < v m such that x < v m' and y < v m'. 

It follows from the definition that U„ is: commutative: xU v y = v y\A v x; idempotent: 
x\A v x = v x\ and associative: (x y) U v z= v x l_l„ (y z). 

Definition 2.5 (Join Semilattice). An ordered set (S,< v ) is a Join Semilattice iff Vx, y G 
S,x\J v y exists. 

A state-based object whose payload takes its values in a semilattice, and where merged, y) = f 
xU v y, converges towards the LUB of the initial and updated values. If, furthermore, updates 
monotonically advance upwards according to <„ (i.e., the payload value after an update is 
greater than or equal to the one before), then it converges towards the LUB of the most 
recent values. Let us call this combination “monotonic semilattice.” 

A type with these properties will be called a Convergent Replicated Data Type or 
CvRDT. We require that, in a CvRDT, compare(x, y) to return x < v y, that abstract 
states be equivalent if x < v y A y < v x, and merge be always enabled. As an example, 
Figure 5 illustrates a CvRDT with integer payload, where <„ is integer order, and where 
merge () = max(). 

Eventual convergence requires that all replicas receive all updates. The communication 
channels of a CvRDT may have very weak properties. Since merge is idempotent and com¬ 
mutative (by the properties of U„), messages may be lost, received out of order, or multiple 
times, as long as new state eventually reaches all replicas, either directly or indirectly via 
successive merges. Updates are propagated reliably even if the network partitions, as long 
as eventually connectivity is restored. 

Proposition 2.1. Any two object replicas of a CvRDT eventually converge, assuming the 
system transmits payload infinitely often between pairs of replicas over eventually-reliable 
point-to-point channels. 

Proof. Any two replicas x t , Xj will converge, as long as they can exchange states by some 
(direct or indirect) channel that eventually delivers, by merging their states. Since CvRDT 
values form a monotonic semilattice, merge is always enabled, and one can make x\ := 
merg e(x i: Xj) and x) := merge(x,, xf). By Definition 2.1, we have the same causal history 
in x! i and x), since C(xf U C(xj) = C(xj) UC(x, : ). Finally we have equivalent abstract states 
x\ = v Xj since, by commutativity of LUB, x-i U,, Xj = v Xj LI„ Xj. □ 
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2.3.2 Operation-based CRDT: Commutative Replicated Data Type (CmRDT) 

In an op-based object, a reliable broadcast channel guarantees that all updates are delivered 
at every replica, in the delivery order < d specified by the data type. Operations not ordered 
by <d are said concurrent ; formally / ||d g f fid 9 A g ytd /• If all concurrent operations 
commute , then all execution orders consistent with delivery order are equivalent, and all 
replicas converge to the same state. Such an object is called a Commutative Replicated 
Data Type (CmRDT). 

As noted earlier, for all data types studied here, causal delivery <_► (which is readily 
implementable in static distributed systems and does not require consensus) satisfies delivery 
order < d . For some data types, a weaker ordering suffices, but then more pairs of operations 
need to be proved commutative. 

Definition 2.6 (Commutativity). Operations f and g commute, iff for any reachable replica 
state S where their source pre-condition is enabled, the source precondition of f (resp. g) 
remains enabled in state S ■ g (resp. S ■ f), and S ■ f ■ g and S ■ f ■ g are equivalent abstract 
states. 

Proposition 2.2. Any two replicas of a CmRDT eventually converge under reliable broad¬ 
cast channels that deliver operations in delivery order <d- 

Proof. Consider object replicas Xi,Xj. Under the channel assumptions, eventually the two 
replicas will deliver the same operations (if no new operations are generated), and we have 
C(xi) =C(xj). For any two operations /, g in C(aq): (1) if they are not causally related then 
they are concurrent under < d (that is never stronger than causality) and must commute; 
(2) if they are causally related a —> b but are not ordered in delivery order < d they must 
also commute; (3) if they are causaly related a — > b and delivered in that order a <db then 
they are applied in that same order everyware. In all cases an equivalent abstract state is 
reached in all replicas. □ 

Recall that reliable causal delivery does not require agreement. It is immune to parti¬ 
tioning, in the sense that replicas in a connected subset can deliver each other’s updates, 
and that updates are eventually delivered to all replicas. As delivery order is never stricter 
than causal delivery, a fortiori this is true of all CmRDTs. 

2.4 Relation between the two approaches 

We have shows two approaches to eventual convergence, CvRDTs and CmRDTS, which 
together we call CRDTs. There are similarities and differences between the two. 

State-based mechanisms (CvRDTs) are simple to reason about, since all necessary in¬ 
formation is captured by the state. They require weak channel assumptions, allowing for 
unknown numbers of replicas. However, sending state may be inefficient for large objects; 
this can be tackled by shipping deltas, but this requires mechanisms similar to the op-based 
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Specification 3 Operation-based emulation of state-based object 


1 

payload State-based S 

> S: Emulated state-based object 

2 

initial Initial payload 


3 

update State-based-update (operation /, args a) : state s 


4 

atSource (/, a) : s 


5 

pre S./.precondition(a) 


6 

let s m S.f(a) 

> Compute state applying f to S 

7 

downstream (s) 


8 

S := mergers, s) 



approach. Historically, the state-based approach is used in file systems such as NFS, AFS 
[19], Coda [22], and in key-value stores such as Dynamo [10] and Riak. 

Specifying operation-based objects (CmRDTs) can be more complex since it requires 
reasoning about history, but conversely they have greater expressive power. The payload 
can be simpler since some state is effectively offloaded to the channel. Op-based replication 
is more demanding of the channel, since it requires reliable broadcast, which in general 
requires tracking group membership. Historically, op-based approaches have been used in 
cooperative systems such as Bayou [31], Rover [21] IceCube [33], Telex [4]. 


2.4.1 Operation-based emulation of a state-based object 

Interestingly, it is always possible to emulate a state-based object using the operation-based 
approach, and vice-versa. 3 

In Spec. 3 we show operation-based emulation of a state-based object (taking some 
liberties with notation). Ignoring queries (which pose no problems), the emulating operation- 
based object has a single update that computes some state-based update (after checking for 
its precondition) and performs merge downstream. The downstream precondition is empty 
because merge must be enabled in any reachable state. The emulation does not make use of 
compare. 

Note that if the base object is a CvRDT, then merge operations commute, and the 
emulated object is a CmRDT. 


2.4.2 State-based emulation of an operation-based object 

State-based emulation of an operation-based object essentially formalises the mechanics of 
an epidemic reliable broadcast, as shown in Spec. 4 (taking some liberties with notation). 
Again, we ignore queries, which pose no problems. Calling an operation-based update adds 
it to a set of M messages to be delivered; merge takes the union of the two message sets. 

3 Contrary to what Helland says [16], because he only considers read-write state, not a merge operation. 
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Specification 4 State-based emulation of operation-based object 


1 

payload Operation-based P, set M, set D 

> Payload of emulated object, messages, delivered 

2 

initial Initial state of payload, 0,0 



3 

update op-based-update (update /, args a) 

: returns 


4 

pre P./.atSource.pre(a) 


> Check at-source precondition 

5 

let returns = P./.atSource(a) 


> Perform at-source computation 

6 

let u = uniqueQ 



7 

M:=MU {(/, a, u)} 


> Send unique operation 

8 

deliver() 


t> Deliver to local op-based object 

9 

update deliver () 



10 

for (/, a, u) £ (M \ D) : /.downstream.pre(a) do 


11 

P := P./.downstream(a) 


t> Apply downstream update to replica 

12 

D:=DU{(f,a,u)} 


> Remember delivery 

13 

compare ( R , R') : boolean b 



14 

let b = R.M < R'.M V R.D < R'.D 



15 

merge (R, R') : payload R" 



16 

let R".M = R.M U R'.M 



17 

R". deliverQ 


t> Deliver pending enabled updates 


Specification 5 op-based Counter 

l 

payload integer i 



2 

initial 0 



3 

query value () : integer j 



4 

let j m i 



5 

update increment () 



6 

downstream () 


> No precond: delivery order is empty 

7 

i:=i+ 1 



8 

update decrement () 



9 

downstream () 


> No precond: delivery order is empty 

10 

i:=i- 1 




When an update’s downstream precondition is true, the corresponding message is delivered 
by executing the downstream part of the update. In order to avoid duplicate deliveries, 
delivered messages are stored in a set D. 

Note that the states of the emulating object form a monotonic semilattice. Calling or 
delivering an operation adds it to the relevant message set, and therefore advances the state 
in the partial order, merge is defined to take the union of the M sets, and is thus a LUB 
operation. Remark that M is identical to the causal history of the replica; non-concurrent 
updates appear in M in causal order. If the emulated op-based object is a CmRDT, then 
delivery order is satisfied. Concurrent operations appear in M in any order; if the emulated 
object is a CmRDT, they commute. Therefore, after two replicas merge mutually, their D 
sets are identical and their P payloads have equivalent state. 


RR n" 7506 











inii4tf 5 5 5 M, imiti I -If In 2111 


14 


Shapiro, Preguiga, Baquero, Zawirski 


3 Portfolio of basic CRDTs 

To show the usefulness of the CRDT concept, we now present a number of CRDT designs. 
They are interesting to understand the challenges, possibilities and limitations of CRDTs. 
They also constitute a library of types that can be re-used and combined to build distributed 
systems. We start with some simple types (counters and registers), then move on to collection 
types (sets), and finally some types with more complex requirements (graphs, DAGs and 
sequences). 

Our specifications are written with clarity in mind, not efficiency. In many cases, there 
are clearly equivalent approaches that conserve space, but we systematically preferred the 
more easily-understood version. 

We write either state- or op-based specifications, as convenient. For each state-based 
example, we have the obligation to prove that its states form a monotonic semilattice and 
that merge computes a LUB. For each op-based example, we must demonstrate that a 
delivery order exists and that concurrent updates commute. 

3.1 Counters 

A Counter is a replicated integer supporting operations increment and decrement to update 
it, and value to query it. The semantics should be is that the value converge towards the 
global number of increments minus the number of decrements. (Extension to operations 
for adding and subtracting an argument is straightforward.) A Counter CRDT is useful 
in many peer-to-peer applications, for instance counting the number of currently logged-in 
users. 

In this section we discuss different designs for implementing a counter CRDT. Despite 
its simplicity, the Counter exposes some of the design issues of CRDTs. 

3.1.1 Op-based counter 

An op-based counter is presented in Specification 5. Its payload is an integer. Its empty 
atSource clause is omitted; the downstream phase just adds or subtracts locally. It is well- 
known that addition and subtraction commute, assuming no overflow. Therefore, this data 
type is a CmRDT. 

3.1.2 State-based increment-only Counter (G-Counter) 

A state-based counter is not as straightforward as one would expect. To simplify the problem, 
we start with a Counter that only increments. 

Suppose the payload was a single integer and merge computes max. This data type is 
a CvRDT as its states form a monotonic semilattice. Consider two replicas, with the same 
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Specification 6 State-based increment-only counter (vector version) 

1: payload integer[n] P > One entry per replica 

2: initial [0,0,, 0] 

3: update increment () 

4: let g = myIDQ t> g: source replica 

5: P[g]:=P\g] + 1 

6: query value () : integer v 
7: let * = X). P[i] 

8: compare (X, Y) : boolean b 

9: let b = (Vi € [0, n - 1] : X.P[i] < Y.P[i ]) 

10: merge (X, Y) : payload Z 

11: let Vi e [0, n - 1] : Z.P[i] = max(X.P[i], Y.P[i]) 


Specification 7 State-based PN-Counter 

l 

payload integer[n] P, integer[n] N 

> One entry per replica 

2 

initial [0,0,..., 0], [0,0,..., 0] 


3 

update increment () 


4 

let g = myIDQ 

> g: source replica 

5 

P[g\:=P[g\ + 1 


6 

update decrement () 


7 

let g = myIDQ 


8 

m := N\g) + 1 


9 

query value () : integer v 


10 

let 


11 

compare (X, Y) : boolean b 


12 

let b = (Vi £ [0, n - 1] : X.P[{\ < Y.P[i] A Vi 6 [0, n 

- 1] : X.N[i] < Y.iV[i]) 

13 

merge ( X, Y) : payload Z 


14 

let Vi e [0, n - 1] : Z.P[i\ = max(X.P[i], Y.P[i\) 


15 

let Vi € [0, n - 1] : Z.N[l\ = ma x(X.N[i\,Y.N[i\) 
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initial state of 0; at each one, a client originates increment. They converge to 1 instead of 
the expected 2. 

Suppose instead the payload is an integer and merge adds the two values. This is not a 
CvRDT, as merge is not idempotent. 

We propose instead the construct of Specification 6 (inspired by vector clocks). The 
payload is vector of integers; each source replica is assigned an entry. To increment, add 1 
to the entry of the source replica. The value is the sum of all entries. We define the partial 
order over two states X and Y by X < Y Vf G [0, n — 1] : X.P\i] < Y.P\i\, where n is the 
number of replicas. Merge takes the maximum of each entry. This data type is a CvRDT, 
as its states form a monotonic semilattice, and merge produces the LUB. 

This version makes two important assumptions: the payload does not overflow, and the 
set of replicas is well-known. Note however that the op-based version implicitly makes the 
same two assumptions. 

Alternatively, G-Set (described later, Section 3.3.1) can serve as an increment- only 
counter. G-Set works even when the set of replicas is not known. 

The increment- only counter is useful, for instance to count the number of clicks on a 
link in a P2P-replicated web page, or a P2P “I Like It/I Don’t Like It” poll, as is common 
in social networks. 


3.1.3 State-based PN-Counter 

It is not straightforward to support decrement with the previous representation, because 
this operation would violate monotonicity of the semilattice. Furthermore, since merge is a 
max operation, decrement would have no effect. 

Our solution, PN-Counter (Specification 7) basically combines two G-Counters. Its 
payload consists of two vectors: P to register increments , and N for decrements. Its value is 
the difference between the two corresponding G-Counters, its partial order is the conjunction 
of the corresponding partial orders, and merge merges the two vectors. Proving that this is 
a CRDT is left to the reader. 

Such a counter might be useful, for instance, to count the number of users logged in to 
a P2P application such as Skype. To avoid excessively large vectors, only super-peers would 
replicate the counter. Due to asynchrony, the count may diverge temporarily from its true 
value, but it will eventually be exact. 

3.1.4 Non-negative Counter 

Some applications require a counter that is non-negative; for instance, to count the remaining 
credit of an avatar in a P2P game. 
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Specification 8 State-based Last-Writer-Wins Register (LWW-Register) 


1 

payload X x, timestamp t 

> X: some type 

2 

initial _L,0 


3 

update assign ( X w ) 


4 

x,t:= w,now{) 

> Timestamp, consistent with causality 

5 

query value () : X w 


6 

let w = x 


7 

compare ( R , R') : boolean b 


8 

let b = ( R.t < R'.t) 


9 

merge (R, R') : payload R" 


10 

if R.t < R'.t then R".x,R".t = R'.x,R'.t 


11 

else R".x, R".t = R.x, R.t 



However, this is quite difficult to do while preserving the CRDT properties; indeed, 
this is a global invariant, which cannot be evaluated based on local information only. For 
instance, it is not sufficient for each replica to refrain from decrementing when its local value 
is 0: for instance, two replicas at value 1 might still concurrently decrement, and the value 
converges to —1. 

One possible approach would be to maintain any value internally, but to externalize 
negative ones as 0. However this is flawed, since incrementing from an internal value of, say, 
— 1, has no effect; this violates the semantics required in Section 3.1. 

A correct approach is to enforce a local invariant that implies the global invariant: e.g., 
rule that a client may not originate more decrements than it originated increments (i.e., 
V<7 : P[g\ — N[g] > 0). However, this may be too strong. 

Note that one of the Set constructs (described later, Section 3.3) might serve as a non¬ 
negative counter, using add to increment and remove to decrement. However this does not 
have the expected semantics: if two replicas concurrently remove the same element, the 
result is equivalent to a single decrement. 

Sadly, the remaining alternative is to synchronise. This might be only occasionally, e.g., 
by reserving in advance the right to originate a given number of decrements , as in escrow 
transactions [28]. 

3.2 Registers 

A register is a memory cell storing an opaque atom or object (noted type X hereafter). It 
supports assign to update its value, and value to query it. Non-concurrent assigns preserve 
sequential semantics: the later one overwrites the earlier one. Unless safeguards are taken, 
concurrent updates do not commute; two major approaches are that one takes precedence 
over the other (LWW-Register, Section 3.2.1), or that both are retained (MV-Register, 
Section 3.2.2). 
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Specification 9 Op-based LWW-Register 

payload X x, timestamp t > X: some type 

initial _L,0 

query value () : X w 
let w = x 

update assign (X x') 
atSource () t' 

let t' = nowQ t> Timestamp 

downstream ( x',t') > No precond: delivery order is empty 

if t < t’ then x,t := x',t ' 



xr=(l,3) 

-e- 

2 . 1 ) 

-O - 

*3>=(3,2) 

-©- 


*r-=(l,3) 


X3‘=(3,2) X3’=(l,3) 


Figure 7: Integer LWW Register (state-based). Payload is a pair (value, timestamp) 



xr=({a},3) 


X,= ({},0)- --w-- _.w 

X2'=({b,c},l) 

*2= ({},0) -•—-- 

X3-=tfa,b},2) N* \» 

X3 = (0.0) -<3S>-O- 


*r=({a},3) 


X3-=({a,b},2) xr==({o},3) 


Figure 8: LWW-Set (state-based). Payload is a pair (set, timestamp) 
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={/} *,=={2} 


{ 2*o, 3'i} 




V'o} {2 2 o} 




Figure 9: MY-Register (state-based) 



Figure 10: MV-Register counter-example 


3.2.1 Last-Writer-Wins Register (LWW-Register) 

A Last-Writer-Wins Register (LWW-Register) creates a total order of assignments by asso¬ 
ciating a timestamp with each update. Timestamps are assumed unique, totally ordered, 
and consistent with causal order; i.e., if assignment 1 happened-before assignment 2, the 
former’s timestamp is less than the latter’s [24]. This may be implemented as a per-replica 
counter concatenated with a unique replica identifier, such as its MAC address [24]. 

The state-based LWW-Register is presented in Specification 8. The type of the value 
can be any (local) data type X. The value operation returns the current value. The assign 
operation updates the payload with the new assigned value, and generates a new timestamp. 
The monotonic semilattice orders two values by their associated timestamp; merge procedure 
selects the value with the maximal timestamp. Clearly, this data type is a CvRDT. Figure 7 
illustrates an integer LWW-Register. 

Specification 9 presents the op-based LWW-Register. Operation assign generates a new 
timestamp at the source. Downstream, the update takes effect only if the new timestamp is 
greater than the current one. Because of the way timestamps are generated, this preserves 
the sequential semantics; concurrent assignments commute since, whatever the order of 
execution, only the one with the highest timestamp takes effect. 

LWW-Registers, first described by Thomas [20], are ubiquitous in distributed systems. 
For instance, in a replicated file system such as NFS, type X is a file (or even a block in a 
file). Many other uses are possible; for instance, in Figure 8, X is a set (LWW-Set). 
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Specification 10 State-based Multi-Value Register (MV-Register) 

payload set S > set of (x, V) pairs; x £ X\ V its version vector 

initial {(_L, [0,..., 0])} 
query incVV () : integer[n] V' 
let g = myIDQ 
\etV = {V\3x:(x,V)£S} 
let V = [ ma x veV (V\j]) ] i#9 
let V'[g] = maxyevC^b]) + 1 

update assign (set R) > set of elements of type X 

let V = incVV() 

S:=Rx {V} 
query value () : set S' 
let S'= S 

compare (A , B) : boolean b 

let b = (V(®, V) £ A, {x', V')eB:V< V') 
merge (A , B) : payload C 

let A' = {(x, V) e A\\/(y, W) £ B : V || W V V > W} 
let B' = {{y, W) £ B\\/(x, V) £ A : W || V V W > V} 
let C = A' U B' 


3.2.2 Multi-Value Register (MV-Register) 

An alternative semantics is to define a LUB operation that merges concurrent assignments, 
for instance taking their union, as in file systems such as Coda [19] or in Amazon’s shopping 
cart [10]. Clients can later reduce multiple values to a single one, by a new assignment. 
Alternatively, in Ficus [34] merge is an application-specific resolver procedure. 

To detect concurrency, a scalar timestamp (as above) is insufficient. Therefore the state- 
based payload is a set of (X, versionVector) pairs, as shown in Spec. 10, and illustrated in 
Figure 9 (the op-based specification is left as an exercise to the reader). A value operation 
returns a copy of the payload. As usual, assign overwrites; to this effect, it computes a 
version vector that dominates all the previous ones. Operation merge takes the union of 
every element in each input set that is not dominated by an element in the other input set. 

As noted in the Dynamo article [10], Amazon’s shopping cart presents an anomaly, 
whereby a removed book may re-appear. This is illustrated in the example of Figure 10. 
The problem is that, MV-Register does not behave like a set, contrary to what one might 
expect since its payload is a set. We will present clean specifications of Sets in Section 3.3. 
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Figure 11: Counter-example: Set with concurrent add and remove (op-based) 

3.3 Sets 

Sets constitute one of the most basic data structures. Containers, Maps, and Graphs are all 
based on Sets. 

We consider mutating operations add (takes its union with an element) and remove 
(performs a set-minus). Unfortunately, these operations do not commute. Therefore, a Set 
cannot both be a CRDT and conform to the sequential specification of a set. 

To illustrate, consider the naive replicated op-based set of Figure 11. Operations add 
and remove are applied sequentially as they arrive. Initially, the set is empty. Replica 1 
adds element a, then removes a; its state is again empty. Replica 2 adds the same element 
o; when Replica 1 applies this operation, its (final) state becomes {a}. Replica 3 receives 
the two add operations; the second one has no effect since a is already in the set. Then 
it receives the remove , which makes its state empty. Both Replica 1 and Replica 3 have 
applied all operations in causal order, yet they diverge. 

Thus, a CRDT can only approximate the sequential set. Hereafter, we will examine a few 
different approximations that differ mainly by the result of concurrent add(e) ||<2 remove(e). 
The 2P-Set hereafter (Section 3.3.2) gives precedence to remove, OR-Set (Section 3.3.5) to 
add. 4. 


3.3.1 Grow-Only Set (G-Set) 

The simplest solution is to avoid remove altogether. A Grow-Only Set (G-Set), illustrated 
in Figure 2, supports operations add and lookup only. The G-Set is useful as a building 
block for more complex constructions. 

In both the state- and op-based approaches, the payload is a set. Since add is based 
on union, and union is commutative, the op-based implementation converges; G-Set is a 

4 Note that two clients may concurrently remove the same element. Despite a superficial similarity, our 
Sets are different from Tuple Spaces [7], in which removes are totally ordered. 
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Specification 11 State-based grow-only Set (G-Set) 
1: payload set A 
2: initial 0 

3: update add (element e) 

4: A := A U {e} 

5: query lookup (element e) : boolean 6 
6: let b = (e 6 A) 

7: compare (5, T) : boolean b 
8: let 6 = (S.A C T.A) 

9: merge ( S , T) : payload U 
10: let U.A = S.A U T.A 


Specification 12 State-based 2P-Set 

1: payload set A, set R t> A: added; R: removed 

2: initial 0,0 

3: query lookup (element e) : boolean b 
4: let 6 = (e € A A e g R) 

5: update add (element e) 

6: A := A U {e} 

7: update remove (element e) 

8: pre lookup(e ) 

9: R := R U {e} 

10: compare ( S , T) : boolean b 

11: let 6 = ( S.A C T.A V S.R C T.R ) 

12: merge (S, T) : payload U 
13: let U.A = S.A U T.A 

14: let U.R = S.R U T.R 


CmRDT. The precondition to add is true, therefore delivery order is empty (operations can 
execute in any order). 

In the state-based approach, add modifies the local state as shown in Specification 11. 
We define a partial order on some states S and Ta,sS<T<=>SCT and the merge 
operation as merge(S, T) = S U T. Thus defined, states form a monotonic semilattice and 
merge is a LUB operation; G-Set is a CvRDT. 

3.3.2 2P-Set 

Our second variant is a Set where an element may be added and removed, but never added 
again thereafter. This Two-Phase Set (2P-Set) is specified in Specification 12 and illustrated 
in Figure 3. It combines a G-Set for adding with another for removing; the latter is col¬ 
loquially known as the tombstone set. To avoid anomalies, removing an element is allowed 
only if the source observes that the element is in the set. 
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Specification 13 U-Set: Op-based 2P-Set with unique elements 


1 

payload set S 


2 

initial 0 


3 

query lookup (element e) : boolean b 


4 

let b=(eeS) 


5 

update add (element e) 


6 

atSource (e) 


7 

pre e is unique 


8 

downstream (e) 


9 

S:=S U{e} 


10 

update remove (element e) 


11 

atSource (e) 


12 

pre lookup(e) 

> 2P-Set precondition 

13 

downstream (e) 


14 

pre add(e) has been delivered 

t> Causal order suffices 

15 

S:=S\{e} 



State-based 2P-Set The state-based variant is in Specification 12. The payload is com¬ 
posed of local set A for adding, and local set R for removing. The lookup operation checks 
that the element has been added but not yet removed. Adding or removing a same element 
twice has no effect, nor does adding an element that has already been removed. The merge 
procedure computes a LUB by taking the union of the individual added- and removed-sets. 
Therefore, this is indeed a CRDT. 

Note that a tombstone is required to ensure that, if a removed element is received 
by a downstream replica before its added counterpart, the effect of the remove still takes 
precedence. 


Op-based 2P-Set Consider now the op-based variant of 2P-Set. Concurrent adds of 
the same element commute, as do concurrent removes. Concurrent operations on different 
elements commute. Operation pairs on the same element add(e) / add(e) and remove(e) || g 
remove{e) commute by definition; and remove(e) can occur only after add(e). It follows that 
this data type is indeed a CRDT. 


U-Set 2P-Set can be simplified under two standard assumptions, as in Specification 13. 
If elements are unique, a removed element will never be added again. 5 If, furthermore, a 
downstream precondition ensures that add(e ) is delivered before remove(e), there is no need 
to record removed elements, and the remove-set is redundant. (Causal delivery is sufficient 
to ensure this precondition.) Spec. 13 captures this data type, which we call U-Set. 

5 Function unique returns a unique value. It could be a Lamport clock, as in Section 3.2.1; alternatively, 
it might select a number randomly from a large space, ensuring uniqueness with high probability. 
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Figure 12: LWW-element-Set; elements masked by one with a higher timestamp are elided 
(state-based) 


If we assume (as seems to be the practice) that every element in a shopping cart is 
unique, then U-Set satisfies the intuitive properties requested of a shopping cart, without 
the Dynamo anomalies described in Section 3.2.2. 

U-Set is a CRDT. As every element is assumed unique, adds are independent. A 
remove operation must be causally after the corresponding add. Accordingly, there can be 
no concurrent add and remove of the same element. 

3.3.3 LWW-element-Set 

An alternative LWW-based approach, 6 which we call LWW-element-Set (see Figure 12), 
attaches a timestamp to each element (rather than to the whole set, as in Figure 8). Con¬ 
sider add-set A and remove-set R, each containing (element, timestamp) pairs. To add 
(resp. remove) an element e, add the pair (e, nowQ), where now was specified earlier, to 
A (resp. to R). Merging two replicas takes the union of their add-sets and remove-sets. 
An element e is in the set if it is in A, and it is not in R with a higher timestamp: 
lookup(e) = 3 t,Vt' > t : (e, t) e A A (e, t') R). Since it is based on LWW, this data 
type is convergent. 


3.3.4 PN-Set 

Yet another variation is to associate a counter to each element, initially 0. Adding an element 
increments the associated counter, and removing an element decrements it. The element is 
considered in the set if its counter is strictly positive. An actual use-case is Logoot-Undo 
[43], a (totally-ordered) set of elements for text editing. 

However, as noted earlier (Section 3.1.3), a CRDT counter can go positive or negative; 
adding an element whose counter is already negative has no effect. Consider the following 
example, illustrated in Figure 13. Initially, our PN-Set is empty. Replica 1 performs add(e): 

6 Due to Hyun-Gul Roh [private communication]. 
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Specification 14 Molli, Weiss, Skaf Set 

l 

payload set S = {(element, count),...} 

> set of pairs 

2 

initial E x {0} 

> Initialise all counts to 0 

3 

query lookup (element e) : boolean b 


4 

let 6 = ((e, k) € S A k > 0) 


5 

update add (element e) 


6 

atSource (e) : integer j 

> j: increment 

7 

if 3(e, k) e S : k < 0 then 


8 

let j = |fe| + 1 


9 

else 


10 

let j = 1 


11 

downstream (e, j) 


12 

let k' : (e, k') e S 


13 

S:=S\{(e,k')}U{(e,k' + j)} 


14 

update remove (element e) 


15 

atSource (e) 


16 

pre lookup(e) 


17 

downstream (e) 


18 

S:=S\ {(e, k')} U {(E , k' - 1)} 




Figure 13: PN-Set (op-based) 
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Specification 15 Op-based Observed-Remove Set (OR-Set) 

1: payload set S > set of pairs { (element e, unique-tag u), ...} 

2: initial 0 

3: query lookup (element e) : boolean b 
4: let 6 = (3u : (e, u) 6 S) 

5: update add (element e) 

6: atSource (e) 

7: let a = uniqueQ > uniqueQ returns a unique value 

8: downstream (e, a) 

9: S := SU {(e, a)} 

10: update remove (element e) 

11: atSource (e) 

12: pre lookup(e ) 

13: let R = {(e,u)|3u : (e,u) € S} 

14: downstream (R ) 

15: pre V(e, u) € R : add(e, u ) has been delivered > U-Set precondition; causal order suffices 

16: S := S\R > Downstream: remove pairs observed at source 


element e has a count of 1. The operation propagates to Replica 3. Now Replicas 1 and 3 
both concurrently execute remove(e ); after Replica 3 applies both operations, e has a count 
of —1. A subsequent add(e) has no effect: thus, after adding an element to an empty “set” 
it remains empty! For some applications, this may be the intended semantics, for instance, 
in an inventory, a negative count may account for goods in transit. In others, this may be 
considered a bug. 

Although the semantics are strange, PN-Set converges; thus if Replica 2 concurrent 
executes add(e) all replicas converge to state {e}. 

An alternative construction due to Molli, Weiss and Skaf [private communication] is 
presented in Specification 14. To avoid the above add anomaly, add increments a negative 
count of k by |/c| + 1; however this presents other anomalies, for instance where remove has 
no effect. 

Both these constructs are CRDTs because they combine two CRDTS, a Set and a 
Counter. 


3.3.5 Observed-Remove Set (OR-Set) 

The preceding Set constructs have practical applications, but are somewhat counter-intuitive. 
In 2P-Set (Section 3.3.2), a removed element can never be added again; in LWW-Set (Fig¬ 
ure 8) the outcome of concurrent updates depends on opaque details of how timestamps are 
allocated. 
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Figure 14: Observed-Remove Set (op-based) 


We present here the Observed-Removed Set (OR-Set), which supports adding and re¬ 
moving elements and is easily understandable. The outcome of a sequence of adds and 
removes depends only on its causal history and conforms to the sequential specification of a 
set. In the case of concurrent add and remove of the same element, add has precedence (in 
contrast to 2P-Set). 

The intuition is to tag each added element uniquely, without exposing the unique tags in 
the interface. When removing an element, all associated unique tags observed at the source 
replica are removed, and only those. 

Spec. 15 is op-based. The payload consists of a set of pairs (element, unique-identifier). 
A lookup(e ) extracts element e from the pairs. Operation add(e) generates a unique identifier 
in the source replica, which is then propagated to downstream replicas, which insert the pair 
into their payload. Two add(e) generate two unique pairs, but lookup masks the duplicates. 

When a client calls remove{e) at some source, the set of unique tags associated with e at 
the source is recorded. Downstream, all such pairs are removed from the local payload. Thus, 
when remove(e) happens-after any number of add(e), all duplicate pairs are removed, and 
the element is not in the set any more, as expected intuitively. When add(e ) is concurrent 
with remove(e), the add takes precedence, as the unique tag generated by add cannot be 
observed by remove. 

This behaviour is illustrated in Figure 14. The two add(a) operations generate unique 
tags a and j3. The remove(a) called at the top replica translates to removing (a, a) down¬ 
stream. The add called at the second replica is concurrent to the remove of the first one, 
therefore (a, /3) remains in the final state. 

OR-Set is a CRDT. Concurrent adds commute since each one is unique. Concurrent 
removes commute because any common pairs have the same effect, and any disjoint pairs 
have independent effects. Concurrent add(e) and remove(f) also commute: if e ^ f they 
are independent, and if e = / the remove has no effect. 

We leave the corresponding state-based specification as an exercise for the reader. Since 
every add is effectively unique, a state-based implementation could be based on U-Set. 
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Figure 15: Maintaining strong properties in a graph (counter-example). Left: initial state 

and update (dashed edges removed, dotted edges added); 

right: final state. 

Specification 16 2P2P-Graph (op-based) 

l 

payload set VA, VR, EA, ER 


2 

t> V: vertices; E: edges; A: added; R: removed 

3 

initial 0,0,0,0 


4 

query lookup (vertex v ) : boolean b 


5 

let 6 = (v G ( VA \ VR)) 


6 

query lookup (edge (u,v)) : boolean b 


7 

let 6 = ( lookup(u ) A lookup(v) A (u,v) G (EA \ ER)) 


8 

update addVertex (vertex w) 


9 

atSource ( w ) 


10 

downstream (w) 


11 

VA := VA U {w} 


12 

update addEdge (vertex u, vertex v) 


13 

atSource (u, v) 


14 

pre lookup(u) A lookup(v) 

> Graph precondition: E C V X V 

15 

downstream (u, v) 


16 

EA:=EA U{(w,u)} 


17 

update removeVertex (vertex w) 


18 

atSource (w) 


19 

pre lookup(w) 

> 2P-Set precondition 

20 

pre V(n, v) G (EA\ER) : u =£ w Av=£w 

> Graph precondition: E C V x V 

21 

downstream (w) 


22 

pre addVertex(w) delivered 

> 2P-Set precondition 

23 

VR := VR U {w} 


24 

update removeEdge (edge (u,v)) 


25 

atSource ((«,«)) 


26 

pre lookup((u,v)) 

> 2P-Set precondition 

27 

downstream ( u, v) 


28 

pre addEdge(u, v) delivered 

> 2P-Set precondition 

29 

ER := ER U {(u,«)} 
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Figure 16: Monotonic DAG. Left: Greek letters indicate vertex identifiers; roman letters are 
characters in a text-editing application. Right: Remove OK only if paths are maintained. 
Dashed: removed; dotted: added. 

3.4 Graphs 

A graph is a pair of sets ( V, E ) (called vertices and edges respectively) such that E C V X V. 
Any of the Set implementations described above can be used for to V and E. 

Because of the invariant E C V x V, operations on vertices and edges are not in¬ 
dependent. An edge may be added only if the corresponding vertices exist; conversely, 
a vertex may be removed only if it supports no edge. What should happen upon con¬ 
current addEdge(u,v) ||d rernoveVertex(u) ? We see three possibilities: (i) Give prece¬ 
dence to removeVertex(u ): all edges to or from u are removed as a side effect. This it 
is easy to implement, by using tombstones for removed vertices, (ii) Give precedence to 
addEdge(u,v ): if either u or v has been removed, it is restored. This semantics is more 
complex, (in) removeVertex(u) is delayed until all concurrent addEdge operations have ex¬ 
ecuted. This requires synchronisation. Therefore, we choose Option (i). Our Spec. 16 uses 
a 2P-Set for vertices (in order to have tombstones) an another for edges (since they are not 
unique). 

A 2P2P-Graph is the combination of two 2P-Sets; as we showed, the dependencies be¬ 
tween them are resolved by causal delivery. Dependencies between addEdge and removeEdge, 
and between addVertex and removeVertex are resolved as in 2P-Set. Therefore, this con¬ 
struct is a CRDT. 
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Specification 17 Add-only Monotonic DAG (op-based) 


1: payload set V, set E 
2: initial {b, H}, {(b, H)} 

3: query lookup (vertex v) : boolean b 
4: let b=(veV) 

5: query lookup (edge (u,v)) : boolean b 
6: let b = ((u, v) € E) 

7: query path (edge (m,u)) : boolean b 
8: \etb=(3wi,...,w m €V :wi = uAw m 

9: update addEdge (vertex u, vertex v ) 

10: atSource (u, v) 

11: pre lookup{u) A lookup{v) 

12: pre path(u, v) 

13: downstream ( u, v ) 

14: pre lookup{u) A lookup{v) 

15: E-m'M U {(«,«)} 

16: update addBetween (vertex u,v,w ) 

17: atSource (u, v, w) 

18: pre v is unique 

19: pre lookup{u) A lookup{w) 

20: pre path(u, w) 

21: downstream ( u,w,v ) 

22: pre lookup(u ) A lookup(w) 

23: V :=VU {«} 

24: E := ED {(u,v),(v,w)} 


> V: vertices; E\ edges 
> Initialised with two sentinels and single edge. 

v A (V? : (wj,w j+ 1) 6 E)) 

t> Graph precondition 

> Monotonic-DAG condition 

> Graph precondition 

t> Graph precondition 

> Monotonic-DAG condition 

> Graph precondition 


replica I 


X -X X 


replica 2 


w x H 

X *X X >• 

. .r 


Figure 17: Monotonic DAG: remove is not live. Dashed: removed; dotted: added. 
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3.4.1 Add-only monotonic DAG 

In general, maintaining a particular shape, such as a tree or a DAG, cannot be done by 
a CRDT. 7 Such a global invariant cannot be determined locally; maintaining it requires 
synchronisation. 

Figure 15 presents two counter-examples. Replicated graph u,v contains no edge. A 
client adds edge (u,v) at Replica 1; concurrently another client adds (v. u) at Replica 2. 
Each of these maintains the DAG shape, but when the changes at Replica 2 propagate to 
Replica 1, the graph is cyclic. Similarly, initially the graph w, x, y, z, t form a replicated tree. 
Clients at Replicas 1 and 2 add and remove edges as indicated in the figure, maintaining 
the tree shape. However, after propagation, the graph is cyclic. 

However, some stronger forms of acyclicity are implied by local properties, for instance 
a monotonic DAG, in which an edge may be added only if it oriented in the same direction 
as an existing path. 8 That is, the new edge can only strengthen the partial order defined by 
the DAG; it follows that the graph remains acyclic. Specification 17 specifies an Add-Only 
Monotonic DAG, illustrated in Figure 16 (left). The DAG is initialised with left and right 
sentinels b and H and edge (b, H). The only operation for adding a vertex is addBetween in 
order to maintain the DAG property. The first operation must be addBetween( b,H). 

Add-only Monotonic DAG is a CRDT, because concurrent addEdge (resp. addBetween) 
either concern different edges (resp. vertices) in which case they are independent, or the 
same edge (resp. vertex), in which case the execution is idempotent. 

Generalising monotonic DAG to removals proves problematic. It should be OK to remove 
an edge (expressed as a precondition on removeEdge) as long as this does not disrupt paths 
between distinct vertices. Namely, if there exists a path from u to v, and w ± u,v, then a 
path should remain after removing (x,w) or (w, x), whatever x £ V. A client could satisfy 
it by creating an alternative path if necessary, e.g., by calling addEdgeiu, v) before removing 
(u,w), as illustrated in Figure 16 (right). 

Unfortunately, this is not live, as illustrated by the scenario of Figure 17. Here, a 
client adds a vertex around w, removes the edges to and from w, and finally removes w. 
Concurrently, another client (at another source replica) does the same with x. When the 
former operations propagate, the downstream precondition of addEdge is false at Replica 2, 
and, consequently the downstream precondition of removeVertex can never be satisfied; and 
vice-versa. 

3.4.2 Add-Remove Partial Order data type 

The above issues with vertex removal do not occur if we consider a Partial Order data type 
rather than a DAG. Since a partial order is transitive, implicitly all alternate paths exist; 

7 Unless of course the graph has the required shape for some other reason. For instance, a 2P2P-Graph 
could record causal dependence between events in a distributed system, which is acyclic. 

8 It is inspired by WOOT, a CRDT for concurrent editing [30]. 
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Specification 18 Add-Remove Partial Order 

l 

payload set VA, VR, E > 

V\ vertices; E\ edges; A: added, R: removed 

2 

initial {1—, H}, 0, {(1—, H)} 

> Edge between left and right sentinels 

3 

query lookup (vertex v) : boolean b 


4 

let 6 = (v <E VA \ VR) 


5 

query before (vertex u,v) : boolean b 


6 

pre lookup(u) A lookup{v) 


7 

let 6 = (3wi, ...,w m e VA-.wi=uAWm = v 

1 A (Vj : (wj,w j+ 1) 6 E)) 

8 


> Removed vertices are considered too 

9 

update addBetween (vertex u,v,w ) 


10 

atSource (u,v,w) 


11 

pre w is unique 


12 

pre before{u,w) 

t> Monotonic-DAG precondition 

13 

downstream ( u, v, w ) 


14 

pre u e VA A v € VA 


15 

VA := VA U {u} 


16 

E K J {(u, v), ( v, w)j 


17 

update remove (vertex v) 


18 

atSource ( v ) 


19 

pre lookup(y) 

> 2P-Set precondition 

20 

pren^bAn^d 

0 May not remove sentinels 

21 

downstream (v) 


22 

VR := VR U {u} 
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Figure 18: Replicated Growable Array (RGA) 


thus the problematic precondition on vertex removal is not necessary. For the representation, 
we use a minimal DAG and compute transitive relations on the fly (operation before). To 
ensure transitivity, a removed vertex is retained as a tombstone. 9 Thus, Spec. 18 uses a 
2P-Set for vertices, and a G-Set for edges. 

We manage vertices as a 2P-Set. Concurrent addBetweens are either independent or 
idempotent. Any dependence between addBetween and remove is resolved by causal delivery. 
Thus this data type is a CRDT. 

3.5 Co-operative text editing 

Peer-to-peer co-operative text editing is a particularly interesting use case of an add-remove 
order. A text document is a sequence of text elements (characters, strings, XML tags, embed¬ 
ded graphics, etc.). Users sharing a document repeatedly insert a text element ( addBetween ) 
or remove one (remove). Using a CRDT for this ensures that concurrent edits never conflict 
and converge, even for users who remain disconnected from the network for long periods, as 
long as they eventually reconnect. Thus, the WOOT data structure for concurrent editing 
corresponds directly to the Add-Remove Partial Order of Specification 18. 

A Partial Order presents a difficulty, as text is normally sequential, but two concurrent 
inserts at the same position remain unordered. A total order, or sequence, does not have 
this drawback, and in addition can be implemented much more efficiently. 

A sequence for text editing (or just sequence hereafter) is a totally-ordered set of ele¬ 
ments, each composed of a unique identifier and an atom (e.g., a character, a string, an XML 
tag, or an embedded graphic), supporting operations to add an element at some position, 
and to remove an element. 10 We now study two different sequence designs. Such a sequence 
is a CRDT because it a subclass of add-remove total order. 

9 We do not include operations addEdge or removeEdge because it is not clear what semantics would be 
reasonable. 

10 Note that despite the superficial similarity, a sequence cannot implement a queue or stack, as the latter 
support atomic pop operations. 
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Specification 19 Replicated Growable Array (RGA). Represented as a 2P-Set of vertices in a 
linked list. A vertex is a pair (atom, timestamp). Timestamps are unique, positive, and increase consistently 

with causality. 


1 

payload set VA, VR, E 

o VA, VR: 2P-set of vertices; E: edges 

2 


> Vertex = (atom, timestamp) 

3 

let h =£U,-1) 


4 

let A =(1,0) 


5 

initial {b, H}, 0, {(b, H)} 

> Initially, a single edge (b, H) 

6 

query lookup (vertex v) : boolean b 


7 

let 6 = (v <E VA \ VR) 


8 

query before (vertex u, vertex v) : boolean b 

9 

pre lookup(u) A lookup(v) 


10 

let 6 = (3wi, ...,w m E VA:wi = 

u A w m = v A Vj : (wj,w j+ i) E E) 

11 

query successor (vertex u) : vertex v 


12 

pre lookup(u) 


13 

let v € VA : [u, v) € E 


14 

query decompose (vertex u ) : atom a, 

timestamp t 

15 

let a, t: u — ( a, t) 

> Decompose u into atom, timestamp 

16 

update addRight (vertex u, atom a) : 

vertex w 

17 

atSource (u, a) : w 


18 

pre u € RA\(VRU{H}) 

> Graph precondition 

19 

let t = now() 

> Unique timestamp 

20 

let w = (a, t) 


21 

downstream (u, w) 


22 

pre ue VA 

> Graph precondition 

23 

let a, t = decompose(w) 

>p = u 

24 

l, r := u, successor(u ) 


25 

b := true 


26 

while b do 

c> Find an edge (l , r) within which to splice w 

27 

let a',t' = decompose(r) 


28 

if t < t' then 

> Right position, wrong order 

29 

l, r := r, successor(r) 

> Iterate 

30 

else 

> r = H V t > f 

31 


32 

b := false 


33 

update remove (vertex w) 


34 

atSource (w) 


35 

pre lookup(w) 

> 2P-Set precondition 

36 

downstream ( w ) 


37 

pre addRight(_,w) delivered 

> 2P-Set precondition 

38 

VR := VR U {w} 
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3.5.1 Replicated Growable Array (RGA) 

The Replicated Growing Array (RGA), due to Roll et al. [35] implements a sequence as a 
linked list (a linear graph), as illustrated in Figure 18. It supports operations addRight(v, a), 
to add an element containing atom a immediately after element v. An element’s identifier is 
a timestamp, assumed unique and ordered consistently with causality, i.e., if two calls to now 
return t and t', then if the former happened-before the latter, then t < t' [24]. If a client 
inserts twice at the same position, as in “ addRight(v,a)-, addRight(v,by ’ the latter insert 
occurs to the left of the former, and has a higher timestamp. Accordingly, two downstream 
inserts at the same position are ordered in opposite order of their timestamps. As in Add- 
Remove Partial Order, removing a vertex leaves a tombstone, in order to accommodate a 
concurrent add operation. 

For example, in Figure 18, timestamps are represented as a pair (local-clock.client- UID). 

Client 3 added character I at time 30, then R at time 31, to the right of N. Clients 2 and 3 
concurrently (at time 40) inserted an L and an apostrophe to the right of the beginning-of- 
text marker K 

As noted above, RGA is a CRDT because it is a subclass of Add-Remove Partial Order. 

3.5.2 Continuous sequence 

An alternative approach to maintaining a mutable sequence is to place its elements in the 
continuum. Spec. 20 specifies a sequence based on identifying elements in a dense identifier 
space such as M, i.e., where a unique identifier can always be allocated between any two given 
identifiers. Adding an element assigns it an appropriate identifier; identifiers are unique and 
totally ordered (and unrelated by causality). 

As noted above, this data structure is a CRDT because it is a subclass of Add-Remove 
Partial Order. More directly, concurrent adds commute because they occur at different 
positions in the continuum. Adding and deleting different elements commute because they 
are independent operations. Adding an element precedes removing it, and they will be 
applied downstream in that order, by the U-Set assumption of causal delivery. 

Its performance depends crucially on the implementation of identifiers and of allocateldentifierBetween. 
Using real numbers would certainly be possible but costly. 


Identifier tree Instead, we represent the continuum using a tree. The first element is 
allocated at the root. Thereafter, it is always possible to create a new leaf e between any 
two nodes n and m, either to the right of n or to the left of m. To allocate a node e to the 
right of a node n: 

(i) If n has a right sibling ml < m and there exists a free unique tag m" such that 
m < m" < rri' , allocate e as m ". 11 

11 As tags are integers, there is not an infinite supply of unique free tags between two given tags. 
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Specification 20 Mutable sequence based on the continuum 

l 

payload set S > U-Set of (X, identifier) pairs; X: some type 

2 

initial 0 


3 

query lookup (element e) : boolean 6 


4 

let 6 = (e <E S) 


5 

query decompose (element e) : X x, identifier i 


6 

let x,i:e= (x, i) 


7 

query before (element e, element e') : boolean b 


8 

pre lookup(e) A lookup(e') 


9 

let x,i= decompose(e ) 


10 

11 

let x',i' = decompose(e') 
let b= (i < i') 


12 

query allocateldentifierBetween (identifier i, j) : identifier k 


13 

pre i < j 


14 

let k : i < k < j and k unique 


15 

update addBetween (element e, X b, element e') : element / 


16 

atSource (e,b,e r ) : f 


17 

pre lookup(e) A lookup(e') 


18 

pre before{e, e') 


19 

let x, i = decompose (e) 


20 

let x',i' ss decompose (e') 


21 

let / = (6, allocateIdentifierBetween{i, i')) 


22 

downstream (/) 


23 

S:=SU{f} 


24 

update remove (element e) 


25 

atSource (e) 


26 

pre e e S 

t> U-Set precondition 

27 

downstream (e) 


28 

pre add(e) delivered 

> U-Set precondition 

29 

S:=S\{e} 
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(ii) Otherwise, if n has no right child, allocate e as the right child of n. 

(in) Otherwise, let n' be the leftmost descendant of n’s right child; clearly, n < n'. Recur¬ 
sively, allocate e to the left of n'. 

Allocating to the left of to is symmetric, substituting left for right and vice-versa. 


Identifiers A node identifier is a (possibly empty) sequence of pairs (d\,U\)»... •( d m , u rn ), 
one per level in the tree. At each level, dj indicates the direction (0 for left child, 1 for right 
child), and Uj is a unique integer tag. 

The root node has the empty identifier. A child of some node n has identifier to = 
n • (d. u). Siblings are ordered by their relative identifiers; thus siblings to = n • (d,u) and 
m' = n • (d!, v!) compare as to <m! «=>■ d < d! V (d = d' Am < u')). As the tree is traversed 
in in-order, a parent n is greater than its left children and less than its right children; i.e., 
n compares with its child to = n • (d, it) thus: n < to <*=> d = 0. 

In summary, two identifiers n and n' compare as follows. Let j > 0 be the length of 
their longest common prefix: n = (di,iti) • ... • (dj,Uj) • (dj + \, Mj+i) • ... • (dj + k,Uj +k ) 
and n' = (d u ui) • ... • (dj, Uj ) • (d' j+l ,u ' j+1 ),... • (d' j+k ,,u' j+k ,). Then: 

(i) If k = 0 and k! = 0, the two identifiers are identical. 

(ii) If k = 0 and k! > 0, then n! is a descendant of n. It is a right descendant iff d' +1 = 1, 
i.e., n <n' <£> d' j+1 = 1. 

(Hi) Symmetrically, if k > 0 and k! = 0 then n <n' <=>■ d J+1 = 0. 

(iv) If k > 0 and k’ > 0, then either n and n' are siblings, or they descend from siblings. 
In both cases, they are ordered by the siblings’ relative identifiers: n < n' <=>■ dj + i < 
dj +1 ^ (dj +1 = dj +1 A Uj+i < u j+i)- 

Experience Two tree-based CRDTs designed for concurrent editing are Logoot and Tree- 
doc, differing in the details. Logoot [43] always allocates to the right, thus does not require 
d. Treedoc [25, 32] groups sequential adds from the same source into a compact binary tree 
with tombstones (no u part), and uses a sparse, unique tag for concurrent adds only. 

If the tree is well balanced, the identifier size adjusts to the size of the sequence, and 
operations have logarithmic complexity. Experiments with text editing show that over time 
the tree becomes unbalanced. Rebalancing the tree is a kind of garbage collection, which 
we discuss in the next section. 
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4 Garbage collection 

Our practical experience with CRDTs shows that they tend to become inefficient over time, 
as tombstones accumulate and internal data structures become imbalanced [25, 32]. To 
avoid these issues, we investigate garbage collection (GC) mechanisms. Solving distributed 
GC would be difficult without synchronisation. We distinguish two kinds of GC problems, 
which differ by their liveness requirements. 

When these requirements are not met, GC may block. We consider this to be acceptable, 
as GC does not impact correctness (only performance), and the normal operations in the 
object’s interface remain live. 

GC issues concern both state- and op-based CRDTs. However, as CmRDTs hide some 
complexity by requiring stronger channels, this also affects GC. Indeed, reliable broadcast 
channels often implement GC mechanisms of their own. 

4.1 Stability problems 

An update / will sometimes add some information r(f) to the payload in order to deal 
cleanly with operations concurrent with /. As an example, in the Add-Remove Partial 
Order of Section 3.4.2, remove leaves a tombstone in order to allow addBetweens to proceed. 

Once / is stable, i.e., all operations concurrent with / have been delivered, r(/) serves 
no useful purpose. A GC opportunity exists to detect this condition and discard r(f). 
Definition 4.1 (Stability). Update f is stable at replica x, (noted <&,;(/) ) if all updates 
concurrent to f according to delivery order < d are already delivered atXi. Formally, <frj(/) 

Vj : / € C(xj) A fige €(xj ) \ C(x t ) : f \\ d g. 

Liveness of requires that the set of replicas be known and that they not crash per¬ 
manently (undetectably). Under these assumptions, the stability algorithm of Wuu and 
Bernstein [44] can be adapted. The algorithm assumes causal delivery. An update g has an 
associated vector clock v(g). Replica x, maintains the last vector clock value received from 
every other replica Xj, noted Vf mn (j ), which identifies all updates that x, knows to have been 
delivered by xj. Replica must periodically propagate its vector clock to update Vf mn values, 
possibly by sending empty messages. With this information, (Vj : V[ lmn (j) > «(/)) => <!>,; (/). 
Importantly, the information required is typically already used by a reliable delivery mech¬ 
anism, and GC can be performed in the background. 

For instance, our Add-Remove Partial Order data type from Section 3.4.2 could use $ 
to remove tombstones left by remove once all concurrent addBetween updates have been 
delivered. In the state-based emulation of Section 2.4.2, stable messages could be discarded 
(this is Wuu’s original motivation). RGA also uses this approach (Section 3.5.1), as do 
Treedoc and Logoot [32, 35, 43]. 
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Specification 21 Op-based Observed-Remove Shopping Cart (OR-Cart) 

l 

payload set S > triplets { (isbn k, integer n, unique-tag u), ...} 

2 

initial 0 

3 

query get (isbn k) : integer n 

4 

let N = {n'\{k\ n', u') G S A k' = k )} 

5 

if N = 0 then 

6 

let n = 0 

7 

else 

8 

let n = Y, N 

9 

update add (isbn k, integer n ) 

10 

atSource (k, n) 

11 

let a = uniqueQ 

12 

let R = {(&', n', u') 6 S\k' = k} 

13 

downstream (fc, n, a, R ) 

14 

pre V(fc, n,u) € R : add(k, n, u) has been delivered o U-Set precondition 

15 

S := (S\R) U {(k. n, a)} > Replace elements observed at source by new one 

IT 16 

update remove (isbn k) 

17 

atSource ( k ) 

18 

\et R = {(k',n',u') eS\k' = k} 

19 

downstream (R) 

20 

pre \/(k,n,u) € R : add(k,n,u) has been delivered t> U-Set precondition 

21 

S := S\R > Downstream: remove elements observed at source 


4.2 Commitment problems 

Some GC problems require a stronger form of synchronisation. One example is resetting 
the payload across all replicas; for instance, safely removing an entry in a Counter (or in 
a vector clock), removing tombstones from a 2P-Set (thus allowing deleted elements to be 
added again) or rebalancing the tree in Treedoc [25]. In first approximation, this requires 
an atomic, unanimous agreement between all replicas, i.e., a commitment protocol such as 
2-Phase Commit or Paxos Commit [15]. The set of replicas must be known, and liveness 
requires that they all be reachable and responsive. 

To overcome these strong requirements, Letla et al. [25] perform commitment only by a 
small, stable subset of replicas, called the core. The other replicas asynchronously reconcile 
their state with core replicas. 


5 Putting CRDTs to work 

We now turn to a concrete example, maintaining shopping carts in an e-commerce bookstore. 
A shopping cart must be always available for writes, despite failures or disconnection [10]. 
To ensure reliability, data is replicated across both within a data centre for throughput, 
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and across several geographically-distant servers for reliability. Given these assumptions, 
linearisability would incur long response times; CRDTs provide an the ideal solution. 

5.1 Observed-remove Shopping Cart 

We define a shopping cart data type as a map from an ISBN number (a unique number 
representing the book edition) to an integer representing the number of units of the book 
the user wants to buy. Any of the Set abstractions presented earlier extends readily to a 
Map; we choose to extend OR-set presented in Section 3.3.5, as it minimises anomalies. 
An element is a (key, value) pair; concretely the key is a book ISBN (a unique product 
identifier), and the value is a number of copies. 

An op-based OR-Cart is presented in Spec. 21. The payload is a set of triplets (key, 
value, unique-identifier), all initially empty. Two update operations are defined. The add 
operation adds a new, unique, from ISBN to value, which co-exists with existing mappings. 

The remove operation removes all existing mappings for a given key. The source replica 
computes the set of triplets with the given key. Downstream, the update removes the triplets 
computed by the source from the downstream payload. The downstream precondition is the 
same as in 2P-Set and U-Set, namely, that the corresponding adds have been delivered; 
causal delivery is sufficient. 

To order a new book, or to increase the number of copies, the client should call add. To 
cancel an order, the client should call remove. Checking out also calls remove. Decreasing 
the number of copies requires to first cancel the existing order, then adding the number 
required. 

We now prove that OR-Cart is a CRDT by showing that concurrent updates com¬ 
mute. Two adds commute, since each triplet is unique. Also, two removes commute, as the 
downstream set-minus operations are either independent or idempotent. Operation add is 
independent of a concurrent remove, as its triplets are unique. 

5.2 E-commerce bookstore 

Our e-commerce bookstore maintains the following information. Each user account has a 
separate OR-Cart. Assuming accounts are uniquely identified, the mapping from user to 
OR-Cart can be maintained by a U-Map, derived from U-Set in the obvious way. The 
shopping cart is created when the account is first created, and removed when it is deleted 
from the system. 

Let us assume a web interface to the shopping cart. When the user selects book b with 
quantity q, the interface calls add(b, q). If the user increases the quantity to q', the interface 
calls add(b, q' — q). To decrease the quantity to q', the interface calls remove(b ) followed by 
add(b,q'). If the user cancels the book, or brings the quantity to zero, the interface calls 
remove(b). 
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Assume a user calls some operation based on the observed state of his shopping cart. 
Delivery order ensures that, for each product, the state of the shopping cart reflects the last 
operation that the user observed. However, updates might be received by replicas in different 
states, either because failures cause them to be out of synch, as reported by Amazon [10], 
or when two users (e.g., family members) share the same account. In this case, although 
the state observed by the user may be stale, our approach minimises anomalies. Concurrent 
adds are merged as expected; a remove concurrent with an add will cancel the products 
already in the cart, but not those just added, which we believe is the cleanest semantics in 
this case. 

This design remains simple and does not incur the remove anomaly reported for Dynamo 
[10], and does not bear the cost of the version vector needed by Dynamo’s MV-Register 
approach. 


6 Comparison with previous work 

Eventual consistency has been a topic of research in highly-available, large-scale asyn¬ 
chronous systems [37]. With the explosive growth of peer-to-peer, edge computing, grid 
and cloud systems, eventual consistency has become an urgent issue for the industry [5, 41]. 
Contrary to much previous work [10, for instance], we take a formal approach grounded in 
the theory of commutativity and monotonic semilattices. However, we are far from being the 
first to study commutativity as a way to increase performance, availability, responsiveness, 
and to provide consistency at low cost. 

6.1 Commutativity in transactional systems 

Gray et al. show that reconciliation rate is a critical scalability factor for highly available 
replicated database systems [14]. They find that transactions commutativity eases recon¬ 
ciliation in such a setup. They do not assume that all concurrent operations commute as 
we do in this work: we are simplifying the reconciliation problem, but also limiting the 
design space. Similarly, Helland and Campbell suggest to use associative, commutative and 
idempotent operations in order to tolerate transient faults, and to improve scalability and 
availability [16]. 

Weihl designs high-performance concurrency control algorithms for a transactional ADT, 
using commutativity to identify non-conflicting concurrent operations [42]. Weihl distin¬ 
guishes between forward and backward commutativity, which differ by how return values 
and failures are handled. We believe this distinction is not relevant to our specifications, 
where downstream operations do not return values and are never allowed to fail. 

Klingemann et al. [23] build upon Weihl’s theory in a distributed cooperative appli¬ 
cation framework that minimises reconciliation. Forward commutativity relations identify 
conflicting operations, and backward commutativity identifies dependent operations. 
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6.2 Existing CRDTs 

Previous work has designed commutative data types, without identifying the concept of a 
CRDT. 

Johnson and Thomas invented what we called LWW-Register [20]. They compose mul¬ 
tiple registers into a larger CRDT, a database of registers that can be created, updated 
and deleted, using the LWW rule to arbitrate between concurrent assignments and removes 
(i.e., a removed element can be recreated when necessary). LWW ensures a total order of 
operations (without consensus) but this order is arbitrary and some updates are inherently 
lost. 

Wuu and Bernstein [44] describe two CRDTs that they call Dictionary and Log. Their 
Dictionary is a Map CmRDT, similar to our U-Set. It is built on top of a replicated Log of 
operations, which acts as a reliable epidemic broadcast channel; this inspired our state-based 
CmRDT emulation. The main focus of their article is ensuring effective log propagation and 
pruning (as described in Section 4.1), to alleviate unbounded growth of the log. 

Collaborative editing is an area where commutativity has been used (often implicitly) 
to provide user with high responsiveness even in disconnected operation. Thus, Operational 
Transformation attempts to achieve commutativity after the fact [26]. WOOT is an early 
CRDT designed for collaborative editing [30], followed up with Logoot [43] The first two au¬ 
thors of this paper invented the CRDT concept when working on the Treedoc data structure 
for collaborative editing [25, 32]. 

This work exposed the issue of garbage collection in CRDTs. In order to cope with the 
lack of liveness of GC in the presence of faults, Le(ia et al. suggest to move it into a subset 
of stable replicas and reconcile with other replicas asynchronously [25]. 

Weiss et al. designed the sequential buffer CRDT Logoot, extended with a general- 
purpose undo mechanism based on a PN-Counter [43]. This approach suffers from anoma¬ 
lies when the counter goes negative. Using our OR-Set can improve tracking causality of 
visibility-related operations. Martin et al. generalize Logoot to a CRDT maintaining an 
XML [27]. This is a notable real-world application of CRDT composition. 

Dynamo is an example of a production key-value store built for availability [10]. Dynamo 
uses the CRDT technique that we call MV-Register in this paper. Used for Amazon’s 
shopping cart service, Dynamo exposes the anomalies of MV-Register. We propose herein 
to use one of our Set types instead in order to ensure clean semantics. 

6.3 Commutativity-oriented design 

Some previous work already focused on commutativity or semilattices for eventual consis¬ 
tency. 
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The foundations of CvRDTs were introduced by Baquero and Moura [2, 3]. This paper 
extends their work with a specification language, by considering CmRDTs, by studying more 
complex examples, and by considering GC. 

Roh et al. [35, 36] independently developed the Replicated Abstract Data Type con¬ 
cept, which is quite similar to CRDT. They generalise LWW to a generic partial order of 
operations, called precedence transitivity, which they leverage to build several LWW-style 
classes. They present the RGA replicated sequence for co-operative editing. The current 
work considers a larger design space, as we allow any merge function that computes a LUB. 
We formalise Roh’s observation that causal delivery is not always strictly necessary with 
downstream preconditions. Roh addresses the GC issue with Wuu and Bernstein’s stability- 
detection algorithm. 

Ellis and Gibbs’ [12] Operational Transformation (OT) studies sequences for shared 
editing designed as op-based objects. Operations are not commutative by design; however, 
a replica receiving an operation transforms it against previously-received concurrent updates. 
The concurrent editing community has studied OT intensively, and many OT algorithms 
have been proposed. However, Oster et al. demonstrate that most OT algorithms for a 
decentralized OT architecture are incorrect [29]. We believe that designing data types for 
commutativity is both cleaner and simpler. 

Dennis et al. propose to verify commutativity using a declarative modeling language 
[11]. They were able to detect non-commutativity between operations on a particular ADT. 
However, lack of such counterexamples found by the tool does not guarantee commutativity. 
They appear to assume a synchronous system model. 

Alvaro et al.’s so-called CALM approach ensures eventual consistency by enforcing a 
monotonic logic [1]. This is somewhat similar to our rule for CvRDTs, that every update or 
merge operation move forward in the monotonic semilattice. Their Bloom domain-specific 
language comes with a static analysis tool that analyses program flow and identifies non¬ 
monotonicity points, which require synchronization. This approach encourages program¬ 
mers to write monotonic programs and makes them aware of synchronization requirements. 
Monotonic logic is more restrictive than our monotonic semilattice. Thus, Bloom does not 
support remove without synchronisation. 

6.4 Exploiting good connectivity for stronger consistency 

Although eventual consistency ensures availability when a system partitions, Serafini et al. 
suggest to leverage periods of good network conditions to achieve the stronger and more 
desirable linearisability property [39]. They define weak operations , ones that need only to 
be eventually linearized. They show that it is impossible to build such a shared object using 
the OS failure detector, if one requires that all operations terminate even in the presence of 
failures. In future work, we plan to add small doses of synchronous operations, for instance 
to commit a result; it will be interesting to study the impact of Serafini’s results on such 
designs. 
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7 Conclusion 

We presented the concept of a CRDT, a replicated data type for which some simple math¬ 
ematical properties guarantee eventual consistency. In the state-based style, the successive 
states of an object should form a monotonic semilattice and replica merge compute a least 
upper bound. In the op-based style, concurrent operations should commute. 

State-based objects require only eventual communication between pairs of replicas. Op- 
based replication requires reliable broadcast communication with delivery in a well-defined 
delivery order. Both styles of CRDTs are guaranteed to converge towards a common, correct 
state, without requiring any synchronisation. 

We specified a number of interesting CRDTs, in a high-level specification language for 
asynchronous replication based on simple logic. In particular, we focused on container 
types with clean semantics for add and remove operations. The Set is the basic container, 
from which we derive Maps, Graphs, and Sequences. To alleviate unbounded growth and 
unbalance, garbage collection can be performed using a weak form of synchronisation, off of 
the critical path of client-level operations. 

Eventual consistency is a critical technique in many large-scale distributed systems, in¬ 
cluding delay-tolerant networks, sensor networks, peer-to-peer networks, collaborative com¬ 
puting, cloud computing, and so on. However, work on eventual consistency was mostly 
ad-hoc so far. Although some of our CRDTs were known before in the literature or in the 
folklore, this is the first work to engage in a systematic study. We believe this is required if 
eventual consistency is to gain a solid theoretical and practical foundation. 

Future work is both theoretical and practical. On the theory side, this will include un¬ 
derstanding the class of computations that can be accomplished by CRDTs, the complexity 
classes of CRDTs, the classes of invariants that can be supported by a CRDT, the relations 
between CRDTs and concepts such as self-stabilisation and aggregation, and so on. On the 
practical side, we plan to implement the data types specified herein as a library, to use them 
in practical applications, and to evaluate their performance experimentally. Another direc¬ 
tion is to study adding small doses of synchronisation to support infrequent, non-critical 
client operations, such as committing a state or performing a global reset. We will also look 
into stronger global invariants, possibly using probabilistic or heuristic techniques. 
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